Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-6234: Increased timeout value for lowWatermark response to avoid test failing occasionally #4238

Merged
merged 4 commits into from
Apr 12, 2018

Conversation

soenkeliebau
Copy link
Contributor

Increase timeout to fix flaky integration test testLogStartOffsetCheckpoint.

@asfgit
Copy link

asfgit commented Nov 20, 2017

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 20, 2017

FAILURE
7975 tests run, 5 skipped, 1 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 20, 2017

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

Copy link
Member

@ijuma ijuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, left one comment.

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

val future = result.lowWatermarks().get(topicPartition)
try {
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
lowWatermark = future.get(5000L, TimeUnit.MILLISECONDS).lowWatermark()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the get method without a timeout. waitUntilTrue has its own timeout, so we don't need to have another one here.

@soenkeliebau
Copy link
Contributor Author

Thanks for the comment Ismael, that does indeed make much more sense. Have updated the PR.

@asfgit
Copy link

asfgit commented Nov 20, 2017

FAILURE
7975 tests run, 5 skipped, 2 failed.
--none--

@asfgit
Copy link

asfgit commented Nov 20, 2017

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

1 similar comment
@asfgit
Copy link

asfgit commented Nov 20, 2017

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

@soenkeliebau
Copy link
Contributor Author

Not sure how that test failure came to pass, the exception should have been caught and retried. Apparently the test has transient failures for more than one scenario. Is this potentially a mixup of the scala and java versions of LeaderNotAvailableExceptions and thus the catch is ignored?

@ijuma
Copy link
Member

ijuma commented Nov 20, 2017

The exception type is ExecutionException and LeaderNotAvailableException is the cause. The test needs to be fixed.

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

val future = result.lowWatermarks().get(topicPartition)
try {
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
lowWatermark = future.get().lowWatermark()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @soenkeliebau . I agree with your reasoning on the JIRA. But I think as for the fix it is better to still set the timeout which is equal to the default value of waitUntilTrue (it is 15secs), because otherwise we may be blocked more than what is specified as the limit in waitUntilTrue, and then still catch the TimeoutException internally and return false.

cc @cmccabe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we catch the TimeoutException internally? Using a timeout in get is a bit of an anti-pattern, set the request timeout in AdminClient appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Regarding set the timeout: that's right, as long as we set the timeout config in adminClient that should be sufficient, that future.get() without parameter would still throw after configured timeout.
  2. We should still catch TimeoutException in future.get() and return false;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some code to unnest the LeaderNotAvailableException and rethrow if anything else is nested. Also catch the TimeoutException and return false.
As far as I can tell, timeouts are set to 20 in AdminClient and defaults to 15 for waitUntilTrue, so the Timeout should never occur and we are fine - if it does, we handle it.
I'm not sure if in this case it wouldn't be better to explicitly state the timeouts though, to make the purpose of the test easier to understand without digging through default timeout values. Then again, I could just add a comment to that effect :) I'll wait for your input before I push more commits..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to catch the TimeoutException since it's longer than waitUntilTrue and it's a different failure mode than if waitUntilTrue times out because LeaderNotAvailableException happened for too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've removed the check for TimeoutException, since Guozhang hasn't weighed in again and he was advocationg for catching it.

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

val future = result.lowWatermarks().get(topicPartition)
try {
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
lowWatermark = future.get().lowWatermark()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to catch the TimeoutException since it's longer than waitUntilTrue and it's a different failure mode than if waitUntilTrue times out because LeaderNotAvailableException happened for too long.

lowWatermark == 5L
} catch {
case e: LeaderNotAvailableException => false
case e: TimeoutException => false
case e: ExecutionException => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this as:

case e: ExecutionException if e.getCause == LeaderNotAvailableException => false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored as per your suggestion, however the == didn't work as comparison for me since "LeaderNotAvailableException is not a value", so I changed it to isInstanceOf[].

@ijuma
Copy link
Member

ijuma commented Nov 23, 2017

retest this please

1 similar comment
@ijuma
Copy link
Member

ijuma commented Nov 23, 2017

retest this please

@soenkeliebau
Copy link
Contributor Author

I had high hopes this time :)

@ijuma
Copy link
Member

ijuma commented Nov 23, 2017

retest this please

case e: LeaderNotAvailableException => false
}

case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] => false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is failing when the cause is NotLeaderForPartitionException. Is that the exception that you meant to use? Or do we need to catch both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've also seen it fail with LeaderNotAvailableException in the very beginning, so we probably need to catch both exceptions here. To be honest I never really questioned the original code that caught only LeaderNotAvailableException and simply fixed unnesting the exception.

Perhaps it would make sense to check for RetriableException instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code to test for a retriable exception and as far as I can tell it does what it is supposed to do. This should also cover all other relevant cases that allow us to resubmit the delete request I think. For anything else it is ok to fail the test I think.

I've also rebased and squashed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stick to checking the two leader related exceptions since those are the only ones we expect to be thrown in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change of course. I do wonder whether we might make the test too narrow though. If we encounter a retriable exception, retry and get the correct result, should we really fail the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we are throwing a retriable exception that is unexpected, the test should fail. We can then check if we made a mistake in the test or the code.

@soenkeliebau
Copy link
Contributor Author

Not sure what happened with the jdk7 test failure, it was this test case failing, but didn't look like an exception issue when I glanced at it, currently jenkins won't speak to me, so can't look in detail, will check back later.

@ijuma
Copy link
Member

ijuma commented Nov 24, 2017

The latest error was because the low watermark was not 5. We should include the actual watermark in the error message. I'm going to do a separate PR to disable this test while we try to figure it out.

ijuma added a commit to ijuma/kafka that referenced this pull request Nov 24, 2017
It's failing often and it seems like there are multiple
reasons. PR apache#4238 will re-enable it.
@soenkeliebau
Copy link
Contributor Author

Looking through the log I was unsure whether we actually got a different low watermark or just ran into the timeout for waitUntilTrue because Exceptions kept us from ever getting to the actual check.

I'll include the returned value in the error message - we probably need to reset it before the second run, as it is still set to 5 from the first test - so if we run into the timeout due to exceptions the message would read "expected 5 but got 5" which might not help much :)
I'll look into setting an unitialized value so we can test for that and print a specific error message that allows to differentiate between "got a wrong value" and "got no value at all".

@ijuma
Copy link
Member

ijuma commented Nov 24, 2017

Take a look at TestUtils.computeUntilTrue if waitUntilTrue makes it difficult to include the value in the error.

@soenkeliebau
Copy link
Contributor Author

I've added some detail to the error message and reverted the catch back to just the two leader related exceptions.
It's not what I'd call beautiful code, there is probably some nice scala-y way of doing the check for two exceptions, but I couldn't come up with something nicer after googling a while tbh.
Let me know what you think of the extended error message.

asfgit pushed a commit that referenced this pull request Nov 24, 2017
It's failing often and it seems like there are multiple
reasons. PR #4238 will re-enable it.

Author: Ismael Juma <[email protected]>

Reviewers: Rajini Sivaram <[email protected]>

Closes #4262 from ijuma/temporarily-disable-test-log-start-offset-checkpoint
…l occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.
@ijuma
Copy link
Member

ijuma commented Dec 20, 2017

retest this please

1 similar comment
@soenkeliebau
Copy link
Contributor Author

retest this please

@soenkeliebau
Copy link
Contributor Author

Hmm..it worked fine for two consecutive runs now, but a little while ago the JDK7 test consistently failed with the same code. Not sure if something else was changed (couldn't find anything, but only spent five minutes looking, so may very well have missed it) or if this was caused by something else entirely. I could never reproduce the test failure locally but it looked like ssl issues in the test log back when it failed, could this have been caused by load on the build server and some component taking too long to become ready?

@soenkeliebau
Copy link
Contributor Author

retest this please

@soenkeliebau
Copy link
Contributor Author

It seems to be back in working order for now. Unless someone has any idea of what might have caused the jdk7 tests to consistently fail back in November.
Are there any other comments on the PR as such? Otherwise I'd suggest merging and keeping an eye on whether the tests starts acting up again.

@soenkeliebau
Copy link
Contributor Author

retest this please

3 similar comments
@hachikuji
Copy link
Contributor

retest this please

@soenkeliebau
Copy link
Contributor Author

retest this please

@soenkeliebau
Copy link
Contributor Author

retest this please

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on review. Left a couple minor comments. Once addressed, I'm inclined to merge and see if the test case stabilizes.

@@ -759,15 +758,17 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {
// Need to retry if leader is not available for the partition
result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(0L)).asJava)

lowWatermark = Long.MinValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be cleaner to use an Option

val future = result.lowWatermarks().get(topicPartition)
try {
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
lowWatermark = future.get().lowWatermark()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's still useful to pass a timeout to get() to avoid blocking longer than the timeout passed to waitUntil. We could use JTestUtils.DEFAULT_MAX_WAIT_MS or manually set a specific timeout.

nit: no need for parenthesis after lowWatermark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that requestTimeout is 20 seconds, so I think get is OK.

case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] ||
e.getCause.isInstanceOf[NotLeaderForPartitionException] => false
}
}, "Expected low watermark of the partition to be 5 but got ".concat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit more idiomatic to use string interpolation. If we make lowWatermark an option, then we can simplify this to just

s"Expected low watermark of the partition to be 5 but got $lowWatermark"

@soenkeliebau
Copy link
Contributor Author

Thanks for your comments @hachikuji
I have hopefully addressed them in the last commit. I've changed your proposed log message a little, but I think it still looks cleaner than before.

var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark()
assertEquals(5L, lowWatermark)
var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark)
assertTrue(lowWatermark.contains(5L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit better to use assertEquald(Some(5), lowWatermark) so that you get a good error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll change that once Jason has chimed in so that I can address his comments as well if he has any.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. Just had a minor nitpick.

@@ -743,8 +742,8 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

sendRecords(producers.head, 10, topicPartition)
var result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(5L)).asJava)
var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark()
assertEquals(5L, lowWatermark)
var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we use Some? Same below.

Copy link
Contributor Author

@soenkeliebau soenkeliebau Apr 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used Some(..) but subsequently learned that Some(null) is not None, however Option(null) is None and should be preferred so thought that it is better to use Option here (not that I'd expect lowWatermark to return null, but you never really know).

But I really just read about this today, so this might be misguided :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, but in this case lowWatermark is a long which cannot be null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I'll push a commit with this fixed and @ijuma 's concern addressed.

@hachikuji
Copy link
Contributor

Thanks for the updates. I'll merge once the builds complete (assuming no problems).

@hachikuji hachikuji merged commit 886daf5 into apache:trunk Apr 12, 2018
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…transient failures (apache#4238)

Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.

Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
mkedwards pushed a commit to bitpusherllc/kafka that referenced this pull request Feb 11, 2019
…transient failures (apache#4238)

Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.

Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
mkedwards pushed a commit to nexiahome-orig/kafka that referenced this pull request Jul 9, 2019
…transient failures (apache#4238)

Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.

Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants